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SUMMARY 


In  this  paper,  we  present  a  survey  of  some  of  the  most 
commonly  used  nonparametric*  methods  of  survival 
curve  estimation.  This  is  provided  as  background  to 
the  discussion  of  a  relatively  recent  technique  which 
can  be  applied  to  cross-sectional  data.  The  method, 
known  as  Cox  regression,  has  been  used  extensively  in 
the  biomedical  sciences  for  relating  covariates  to 
patient  survival,  but  to  our  knowledge  has  never  been 
applied  to  military  manpower  problems.  A  prime  ob¬ 
jective  of  this  paper  is  to  examine  the  potential  use¬ 
fulness  of  this  procedure  by  applying  it  to  the  1973 
recruit  cohort  of  four-year  obligors.  This  data  set 
has  been  analyzed  previously  by  means  of  a  probit 
analysis,  and  the  results  are  used  as  a  basis  for 
comparison  with  the  Cox  procedure. 

A  favorable  comparison  of  the  Cox  procedure  with  probit 
analysis  on  a  longitudinal  data  base  (the  1973  cohort) 
will  be  regarded  as  evidence  of  its  potential  applica¬ 
tion  to  cross-sectional  data,  to  which  probit  analysis 
cannot  be  applied.  Furthermore,  the  results  of  an 
analysis  using  the  Cox  model  are  more  easily  inter¬ 
pretable  and  computationally  more  efficient  than  the 
probit  counterparts.  On  the  evidence  presented  in  this 
paper,  the  Cox  model  appears  to  be  a  very  useful  proce¬ 
dure  for  estimating  recruit  survival  from  cross- 
sectional  data. 


^The  term  "nonparametric"  means  that  no  assumptions 
are  made  about  the  mathematical  form  of  the 
distribution  of  the  data. 
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INTRODUCTION 


Extensive  analyses  have  been  done  by  CNA  which  relate 
various  pre-service  and  in-service  personnel  character¬ 
istics  to  the  probability  of  surviving  to  a  given  point 
in  time  (usually  the  end  of  the  first  year  or  enlistment 
term).  Statistical  techniques  which  have  been  employed 
for  identifying  these  relationships  include  probit  and 
logit  analysis  (see  references  1  and  2  for  discussions 
of  these  methods.  These  techniques  require  a  sample  or 
population  of  individuals  followed  from  the  day  they 
enter  the  Navy  until  they  either  leave  or  complete 
their  first  term  (longitudinal  data).  Thus  if  we  are 
considering  4-year  obligors,  for  instance,  it  will  be 
necessary  to  follow  a  cohort  through  4  years  of 
service. 

In  order  to  avoid  following  individuals  for  such  a  long 
period  of  time,  it  has  been  proposed  that  cross- 
sectional  data  be  used  for  estimation  of  survival.  A 
cross-sectional  data  set  is  formed  by  selecting  all 
those  individuals  who  are  currently  enlisted  in  the 
Navy.  This  provides  an  enormous  and  potentially  very 
informative  data  base.  However,  the  statistical 
methods  mentioned  above  cannot  be  used  to  exploit  this 
type  of  data,  and  therefore  a  different  method  must  be 
employed. 

The  statistical  technique  which  we  propose  to  use  for 
handling  cross-sectional  data  is  termed  a  Cox  regression 
model  (see  reference  3) .  This  model  is  used  extensive¬ 
ly  in  the  biological  and  health  sciences  for  relating 
covariates  to  survival  of  patients,  but  to  our  know¬ 
ledge  has  never  been  applied  to  military  manpower  prob¬ 
lems.  The  Cox  model  has  the  advantage  of  being  able  to 
generate  a  continuous  survival  curve  rather  than  just  a 
point  estimate  (which  a  probit  analysis  gives).  It 
should  be  noted  that  an  approximation  to  a  continuous 
survival  curve  can  be  obtained  by  applying  a  probit 
analysis  in  a  sequential  manner,  e.g.,  at  monthly  in¬ 
tervals,  but  only  with  a  great  deal  of  computational 
time  and  expense.  The  Cox  model,  on  the  other  hand, 
can  generate  a  survival  curve  at  only  a  fraction  of  the 
time  and  cost. 

The  Cox  model  has  the  additional  advantage  that  it  can 
be  applied  to  cross-sectional  data,  but  it  can,  of 
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course,  also  be  applied  to  longitudinal  data.  Thus  we 
can  compare  the  relative  estimating  abilities  of  probit 
and  Cox  on  data  from  one  cohort  and,  from  the  results, 
make  recommendations  as  to  the  efficacy  of  using  the 
Cox  procedure  in  the  future.  We  shall  base  the  com¬ 
parisons  on  the  1973  recruit  cohort  of  4-year  obligors. 
Separate  analyses  will  be  performed  for  GENDETs  and 
A-schoolers  classified  according  to  mental  group  and 
education  with  their  age,  race,  and  primary  dependents 
held  constant. 
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LONGITUDINAL  VERSUS  CROSS-SECTIONAL  DATA 


The  advantages  and  disadvantages  of  using  longitudinal 
and  cross-sectional  data  are  summarized  below. 

•  Longitudinal  data 

-  Advantages: 

All  individuals  are  observed  until  completion 
of  term  or  attrition. 

All  individuals  start  in  the  same  year  and  are 
subject  to  the  same  patterns  of  attrition. 


-  Disadvantages: 

Data  must  be  collected  and  followed  for  4 
years . 

Analyses  based  on  these  data  may  no  longer  be 
current,  i.e.,  present  attrition  patterns  may 
be  different  from  those  observed  4  years  ago. 

•  Cross-sectional  data 

-  Advantages: 

Attrition  patterns  observed  are  the  most 
current. 

The  data  need  only  be  followed  for  a  relatively 
short  period  of  time  (say  1  year). 


Disadvantages: 

Not  all  individuals  are  observed  until 
completion  of  term  or  attrition. 

The  data  consist  of  a  mixture  of  different 
cohorts . 

Only  the  portion  of  a  cohort  still  in  the  Navy 
at  the  time  of  data  collection  is  observed. 


The  last  disadvantage  listed  for  cross-sectional  data 
can  present  considerable  problems  of  bias  (in  esti¬ 
mating  survival  probabilities)  if  one  is  not  careful 
with  the  analysis.  For  example,  if  the  data  consist  of 
n^  individuals  with  1  year  of  service  and  n2 
individuals  with  two  years  of  service  at  the  time  of 
data  collection  and  the  data  were  followed  for  1  year, 
then  a  greatly  inflated  estimate  of  the  probability  of 
surviving  1  year  could  be  obtained:  each  of  the  n2 
individuals  survived  one  year,  but  the  remainder  of 
their  cohort  who  left  before  this  time  has  not  been 
included  in  the  sample.  This  difficulty  can  be 
overcome  by  performing  a  conditional  analysis.  If  we 
let  P(TM)  be  the  probability  that  an  individual 
survives  at  least  i  years.  i=l,2,  then  we  may  write 

P(T>2)  =  P(T_>2  |  T>l)xP(T>l).  (1) 

The  probability  of  surviving  1  year  may  then  be  esti¬ 
mated  by  considering  only  those  who  start  their  service 
at  the  time  of  data  collection  and  observing  their 
status  at  the  end  of  1  year.  We  can  also  estimate  the 
conditional  probability  of  surviving  2  years  given 
that  1  year  has  already  been  completed,  by  considering 
only  the  n^  individuals  and  observing  their  status 
after  1  year  of  follow-up.  The  probability  of  surviv¬ 
ing  2  years  may  then  be  obtained  from  expression  (1). 
The  same  logic  may  of  course  be  extended  to  more  than  2 
years.  Thus  the  potential  biases  due  to  using  cross- 
sectional  data  can  be  eliminated  by  a  careful  selection 
of  subsets  of  the  population  at  each  point  of  time  to 
be  estimated. 

Another  feature  of  cross-sectional  data  is  that  after 
the  period  of  follow-up  there  will  be  individuals  who 
are  still  in  their  first  term  of  enlistment,  i.e.,  they 
have  neither  completed  their  first  term  nor  left.  Data 
such  as  these  are  termed  in  the  biostatistical  litera¬ 
ture  censored  observations.  Thus,  if  we  are  to  use 
cross-sectional  data  for  making  inferences,  we  must 
use  statistical  methods  which  take  censored  data  into 
account. 
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NONPARAMETRIC  ESTIMATION  OF  A  SURVIVAL 
CURVE  WITHOUT  COVARIATES 

Methods  for  handling  censored  observations  for  the  esti¬ 
mation  of  survival  curves  are  well  documented,  both  pa¬ 
rametric  and  nonparametr ic .  We  shall  concern  ourselves 
here,  however,  with  the  discussion  of  only  the  nonpara- 
metric  methods. 

A  summary  of  the  most  commonly  used  nonparametr ic  meth¬ 
ods  for  estimating  survival  probabilities  is  presented 
below.  Also  given  are  the  types  of  data  to  which  the 
methods  apply. 

•  Empirical  Distribution  Function  (EDF)  -  no 
censored  observations. 

•  Kaplan-Meier  or  Product-Limit  (P-L)  -  generali¬ 
zation  of  EDF  to  censored  data. 

•  Life  Table  -  grouped  data,  interval  counts;  can 
handle  censored  data. 

This  is  not  an  all-inclusive  list,  but  other  methods  for 
estimating  survival  curves  usually  involve  only  modifi¬ 
cations  of  those  listed  above. 

EMPIRICAL  DISTRIBUTION  FUNCTION 

The  Empirical  Distribution  Function  (EDF)  gives  the 
nonparametric  maximum  likelihood  estimate  of  the  true 
underlying  continuous  distribution  function  F.  If  we 
observe  data  X^ ,  X2,  . ..,  Xn  which  are  independently 

and  identically  distributed  (i.i.d.),  then  the  EDF  is 
defined  as  follows: 


Fn(t)  =  (#  of  Xj’s  <  t)/n 


where 


I(Xj-t) 


1  if  Xj-t  <  0 
0  if  X^-t  >  0 


(2) 

(3) 
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From  expression  (3)  we  see  that  the  EDF  is  just  a  step 
function  with  jumps  of  1/n  at  each  of  the  n  order  sta¬ 
tistics  (i.e.,  the  ordered  values  of  X^,...,X  ).  Also 

from  expression  (3)  we  have  that 

n 

E[Fn(t)]  =  £  £  E[l(X.-t)] 

3  =  1 

n 

=  -  y  p(x . <t) 

n  ]— 

3  =  1 

n 

-1 

=  -  2-.  F ( t }  (since  ate  i.i.d.) 

3  =  1 

=  n  *  n  F(t) 

=  F(t) , 

so  that  the  EDF  is  seen  to  be  an  unbiased  estimate  of 
the  true  distribution  function  F.  From  the  Strong  Law 
of  Large  Numbers  (see  reference  4,  for  example),  we 
also  have  that  Fn(t)  is  a  consistent  estimator  for 
F(t);  i.e.,  for  each  value  of  t, 

lim  P | Fn ( t )  =  F(t)f  =  1. 
n—  x 

Thus  as  the  number  of  observations  becomes  large, 

Fn(t)  approximates  a  smooth  function  and  approaches 
the  true  distribution  function  F(t). 

PRODUCT-LIMIT  ESTIMATOR 

The  Product-Limit  (P-L)  estimator  generalizes  the  EDF  to 
handle  censored  observations.  It  was  derived  by  Kaplan 
and  Meier  (reference  5)  and  was  shown  to  be  the  non- 
parametric  maximum  likelihood  estimate  of  the  true  sur¬ 
vival  function  in  the  presence  of  censoring.  To  define 
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the  P-L  estimator,  we  consider  the  situation  where 
failure  times  are  observed  at  k  distinct  points 
OCt^ <t2< . . .  <t)c  with  nj  failures  occurring  at  each 

t j .  This  allows  for  "discretizing"  a  continuous 
distribution;  for  example  when  service  time  is  measured 
in  months  there  will  be  many  individuals  with  the  same 
length  of  service.  Now  we  also  observe  mj  censored 

observations  in  the  interval  1^  =  [tj_^,  t j ) , 
j=l,  2,  k+1,  where  tQ  =  0,  t^+1  =  <x> . 

Kaplan  and  Meier  showed  that  the  maximum  likelihood 
estimator  of  the  survival  function  puts  probability 
mass  not  at  any  censored  observations,  but  only  at  the 
observed  failure  times.  The  censored  observations  do 
play  a  role,  however,  in  determining  what  the  probabil¬ 
ities  at  the  failure  times  will  be. 


k 

Now  let  r.  =  (n.  +  3  =  ...»  k.  This  may 

]  i  =  j 

be  seen  to  be  the  number  of  observations  that  occur  at 
times  greater  than  or  equal  to  t^ .  Then  if  we  let 

Pj=P(T>t j  I  T>tj_j,),  the  maximum  likelihood  estimate  of 


A 


j 


If 


•  •  •  f 


k. 


(4) 


The  estimate  pj  is  just  the  number  of  observations 
strictly  greater  than  tj  divided  by  the  number 
greater  than  or  equal  to  t j .  Note  that  this  esti¬ 
mate  of  pj  utilizes  the  censored  observations  be¬ 
yond  timeJtj  since  they  have  definitely  survived 
the  interval  1^  =  t^),  but  essentially 

ignores  those  censored  observations  which  occur  within 
Ij.  By  virtue  of  ignoring  them,  we  are,  in  effect, 
assigning  proportion  of  them  as  survivors  of  the 
interval. 

From  (4)  we  may  define  the  P-L  estimator  of  S(t)=»P(T>t) 
as 
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S(t) 


1  if  0  <  t  <  tj 

n  p i  if  t.  <  t  <  tj+1,  j=i . k-i 

1  =  1 

(5) 

k  A 

n  P.  if  t  <  t  <  s* 
i  =  l  1  K 

_  undefined  for  t  s* 


where  s*  represents  the  last  observed  data  point  if  it 
is  censored.  Since  the  P-L  estimator  puts  all  its 
probability  mass  at  the  failure  times,  no  additional 
mass  is  placed  in  the  interval  t^  <_  t  <  s*  other 

than  at  t,  .  Although  S(t)  is  undefined  for  t  s*, 

k  A  a 

n  p.  is,  of  course,  an  upper  bound  for  S(t)  for 
i=l  1 

tk  £  t  <  33  • 


For  an  illustration  of  the  P-L  estimator,  consider  the 
following  hypothetical  data: 

failure  times  at  0.8,  3.1,  5.4,  9.2 
censored  times  at  1.0,  2.7,  7.0,  12.1. 


For  this  situation,  we  have  k=4  and  single  observations 
at  each  failure  point,  i.e.,  nj=l,  i=l,...,4.  The 
number  of  censored  observations  in  the  intervals  be¬ 
tween  failure  points  are  respectively  m^=0,  1^=2,  m3=0, 
m^»l,  and  m^*!.  From  this,  we  compute  "  J 


rl 

r2 

r3 

r4 


»  8, 
=  5, 
*  4, 
=  2, 


r-j-n^  =7,  p1  =  7/  8 

r2~n2  =  4'  p2  =  4/>5 

r3-n3  =3,  p3  =  3/4 

r4_n4  =1.  P4  *  1/2 
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from  which  we  derive 


1,  0  <  t  <  0.8 

7/8,  0.  8  ^  t  <  3.1 

S  (t)  =  •  7/10,  3.1  <  t  <  5.4 

21/40,  5.4  <  t  <  9.2 
.21/80,  9.2  <  t  <  12.1 


from  expression  (5).  Note  how  the  estimates  would  dif¬ 
fer  if  we  had  simply  ignored  the  censored  data.  In 
that  case,  the  empirical  survival  function  would  give 
us  decrements  of  0.25  at  each  failure  time  (the  empiri¬ 
cal  survival  function  is  just  1-EDF) . 

LIFE  TABLE  ANALYSIS 

The  Life  Table  method  of  survival  curve  estimation  is 
applied  in  situations  when  complete  information  on  sur¬ 
vival  data  is  not  available.  The  data  to  which  this 
method  applies  take  the  form  of  interval  counts,  i.e., 
only  the  numbers  of  individuals  who  failed  or  were 
censored  in  a  given  interval  are  known.  Even  if  com¬ 
plete  information  is  available,  however,  it  is  often 
more  convenient  to  tabulate  it  in  the  form  of  a  Life 
Table. 

To  illustrate  the  Life  Table  method,  we  use  the  fol¬ 
lowing  example  taken  from  the  Connecticut  tumor  regis¬ 
try.  During  the  years  1946-52,  certain  information  was 
collected  on  the  Connecticut  residents  diagnosed  as 
having  cancer  of  the  kidney  (the  data  were  obtained 
from  Zelen  (reference  6)).  For  each  individual,  the 
date  of  diagnosis  was  recorded.  Each  successive  year 
it  was  noted  whether  (i)  the  patient  was  dead,  (ii)  the 
patient  was  alive,  or  (iii)  the  patient  was  lost  to 
follow-up  (LFU)  during  the  1-year  period.  The  term 
"lost  to  follow-up"  means  that  an  individual  cannot  be 
observed  past  a  certain  point  in  time  because  he  failed 
to  report  to  the  hospital,  moved  to  another  city,  etc. 
The  analogue  of  LFU  for  Navy  manpower  data  might  be  in¬ 
dividuals  who  go  AWOL,  for  instance. 

For  those  who  were  diagnosed  in  1946,  we  have  the  in¬ 
formation  displayed  in  table  1  below. 


TABLE  1 


PATIENTS  WITH  CANCER  OF  THE  KIDNEY  DIAGNOSED 

IN  1946 


Censored 

observations 

Interval 

Number 

Number 

Number 

( years 

alive  at 

died 

lost  to 

after 

start  of 

during 

follow¬ 

Number  of 

diagnosis) 

interval 

interval 

up 

withdrawals 

0-1 

9 

4 

1 

0 

1-2 

4 

0 

0 

0 

2-3 

4 

0 

0 

0 

3-4 

4 

0 

0 

0 

4-5 

4 

0 

0 

0 

5-6 

4 

0 

0 

4 

In  this  table,  the  last  two  columns  represent  the  cen¬ 
sored  observations.  The  term  "withdrawal"  refers  to 
an  individual  who  is  still  alive  after  the  period  of 
observation  (6  years  in  this  case). 

Since  the  period  of  observation  went  until  1952,  we  have 
data  for  6  intervals  for  the  1946  cohort  group.  How¬ 
ever,  for  the  cohort  group  who  were  diagnosed  in  1947  we 
only  have  information  from  5  intervals,  and  correspond¬ 
ingly  we  have  1  less  year  of  observation  for  each  later 
year  of  diagnosis.  The  information  for  the  period 
1947-52  is  given  in  table  2. 

Suppose  we  wanted  to  estimate  the  probability  of  sur¬ 
viving  6  years.  Since  only  the  1946  cohort  was  ob¬ 
served  for  6  years,  a  possible  estimate  would  be  4/8, 
4/9,  or  5/9  depending  on  whether  or  not  we  discard  the 
one  person  lost  to  follow-up,  and  if  not,  whether  we 
assume  life  or  death.  However,  even  though  the  other 
cohort  groups  were  observed  for  less  than  6  years, 
these  data  still  contain  useful  information  which  we 
would  like  to  exploit.  Tables  1  and  2  can  be  combined 
and  the  data  summarized  as  in  table  3. 
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TABLE  2 


PATIENTS  WITH  CANCER  OF  THE  KIDNEY  DIAGNOSED 
FROM  1947  TO  1951 

Censored  observations 


Number 

Year  of 

Number 

Number 

lost  to 

Number  of 

diagnosis 

Interval 

alive 

died  follow-up 

►  withdrawals 

1947 

0-1 

18 

7 

0 

0 

1-2 

11 

0 

0 

0 

2-3 

11 

1 

0 

0 

3-4 

10 

2 

2 

0 

4-5 

6 

0 

0 

6 

1948 

0-1 

21 

11 

0 

0 

1-2 

10 

1 

2 

0 

2-3 

7 

0 

0 

0 

3-4 

7 

0 

0 

7 

1949 

0-1 

34 

12 

0 

0 

1-2 

22 

3 

3 

0 

2-3 

16 

1 

0 

15 

1950 

0-1 

19 

5 

1 

0 

1-2 

13 

1 

1 

11 

1951 

0-1 

25 

8 

2 

15 

TABLE  3 

LIFE 

;  TABLE  FOR  PATIENTS  WITH 

1  CANCER 

OF  THE 

KIDNEY 

Number  Censored  observations 

Number 

died 

Number 

alive 

during  lost  to 

Number  of 

Interval 

at  start 

interval  follow-up  withdrawals 

0-1 

126 

47 

4 

15 

1-2 

60 

5 

6 

11 

2-3 

38 

2 

0 

15 

3-4 

21 

2 

2 

7 

4-5 

10 

0 

0 

6 

5-6 

4 

0 

0 

4 

11 


Let  be  the  event  of  surviving  the  ith  interval 
and  Et  *  A1OA2n. . .0At  be  the  event  of  surviving  the 

years  covered  by  the  first  t  intervals.  Then  we  may 
write 


P(EX)  = 

p(e2)  *  p<a2ie1)p(e1) 

P(E3)  =  P(A3|E2)P(E2)  (6) 


P(Efc)  *  P(AtlEt^1)P(Et_1)  . 


Life  Table  analysis  consists  of  estimating  the  condi¬ 
tional  probabilities  given  by  (6)  and,  from  these,  the 
probability  of  surviving  t  intervals.  The  only  compli¬ 
cations  are  caused  by  the  censored  observations.  In 
order  to  utilize  the  censored  data,  we  need  to  make 
some  distributional  assumptions,  although  these  assump¬ 
tions  need  only  be  very  weak  and  do  not  significantly 
detract  from  the  nonparametric  nature  of  our  estimates. 

Before  we  state  these  assumptions,  however,  we  need  the 
following  definition.  The  hazard  function  h(t)  is  the 
conditional  probability  of  a  failure  in  the  interval 
(t,t+dt),  given  survival  to  time  t,  i.e., 


h ( t)dt  =  P(t  <  T  <  t+dtlT  >  t).  (7) 

By  use  of  (7),  the  survival  function  can  be  expressed 
as 


-  h(x)dx 

S(t)  =  e  (8) 

so  that  knowledge  of  the  hazard  function  implies  know¬ 
ledge  of  the  survival  function. 


We  are  now  ready  to  state  the  assumptions  behind  the 
Life  Table  method.  These  are: 


(i)  There  is  a  constant  hazard  hj  over  each 
interval  Ij, 


(ii)  The  time  of  censoring  in  Ij  for  each 
censored  observation  is  uniformly  distributed 
across  the  interval. 


As  long  as  we  take  our  intervals  Ij  to  be  rela¬ 
tively  small,  these  assumptions  are  quite  reasonable. 

If  we  denote  the  number  of  individuals  alive  at  the 
start  of  the  ith  interval  by  n^,  the  number  who 
died  during  this  interval  by  d^,  and  the  number 
censored  in  the  interval  by  m*,  then  the  maximum 
likelihood  estimates  of  p^»  P(A^|E._^)  are 


A 


i*l, ... ,k, 


(9 


'  i 

where  ni'  =  ni  ~  T"  *  (10 


The  numbers  given  by  (10)  are  called  the  effective  num 
ber  of  observations.  In  effect,  half  of  the  censored 
observations  in  the  interval  Ij  are  considered  to 
have  survived  the  interval,  while  of  the  other  half, 
proportion  of  them  are  assigned  as  survivors. 
Censored  observations  beyond  Ij  are,  of  course,  all 
survivors  of  that  interval.  Using  (9),  the  Life  Table 
estimates  of  S(i)=P(Ei)  may  then  be  obtained  as  in 
table  4  below. 


TABLE  4 


LIFE  TABLE  ESTIMATES  OF  THE  PROBABILITY  OF 

SURVIVAL 


Number  Number 
alive  at  died 

start  of  during  Number  Estimate 

Interval  interval  interval  censored  of  p.  S(i)=P(E.) 


0-1  nx 

1-2  n2 


(k-l)-k 


A  ,  1  A 

mi  Pi"1"  Pi 

m  A  .  d2  A  A 

m2  P2=1~  TiJ  P1P2 

•  •  • 

•  •  • 

•  •  • 

A  .  dk  A  A  A 

mk  pk-1~  n£  pip2 * ’ *pk 


If  we  apply  these  results  to  the  data  of  table  3,  we 
compute  nj*ll6.5,  n^Sl.S,  n^=30.5,  n^=16.5,  n£=7,  and 

n£«2,  from  which  we  obtain  the  probability  estimates 

given  in  table  5. 


TABLE  5 


LIFE  TABLE 

ESTIMATES  FOR  THE 
IN  TABLE  3 

DATA  DISPLAYED 

Interval 

A 

pi 

A 

S(i) 

0-1 

0.5966 

0.5966 

1-2 

0.9030 

0.5387 

2-3 

0.9345 

0.5034 

3-4 

0.8788 

0.4423 

4-5 

1 

0.4423 

5-6 

1 

0.4423 

14- 


What  degree  of  confidence  can  we  place  in  the  Life 
Table  estimates?  First,  it  can  be  shown  that  the 
estimates  of  S(i)  are  unbiased.  In  addition  it  can  be 
shown  that  the  variance  of  S(i)  is  approximately 


Var[S(i) ]  *  [S ( i ) ] 


S  nT 
i=l  3 


a-Pi) 


(ID 


which  clearly  decreases  as  the  original  sample  size 
(and  thus  all  n^')  increases. 

Of  course,  we  do  not  know  the  true  values  of  S(i)  and 
Pj  in  expression  (11),  and  so^we  substitute  the 
maximum  likelihood  estimates  S(i)  and  pj  in  their 
places. 
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NONPARAMETRIC  ESTIMATION  OF  A  SURVIVAL 
CURVE  WITH  COVARIATES 

In  roost  Navy  manpower  problems,  we  are  interested  not 
only  in  survival  probabilities  but  also  in  which  pre¬ 
service  or  in-service  characteristics  affect  survival. 
Nonparametric  methods  for  estimating  survival  curves 
while  adjusting  for  covariates  have  been  developed  only 
recently,  however.  At  CNA,  the  most  commonly  used 
methods  of  accounting  for  covariates  have  been  probit 
and  logit  analyses,  although  these  methods  are  para¬ 
metric.  Since  these  methods  are  used  so  frequently,  we 
shall  include  a  brief  discussion  of  them  in  this  section 
for  the  sake  of  completeness. 

A  summary  of  the  methods  to  be  discussed  in  this  sec¬ 
tion  is  provided  below. 

•  Probit  analyses,  logit  analyses  -  give  point 
estimates  of  probability  of  survival;  do  net 
utilize  censored  data 

•  Cox  regression  model  -  provides  continuous 
estimate  of  survival  curve;  utilizes  censored 
observations . 

As  with  methods  for  estimating  survival  curves  without 
covariates,  this  is  not  an  all-inclusive  list  but  does 
summarize  the  most  commonly  used  methods. 

PROBIT  AND  LOGIT  ANALYSES 

Since  both  probit  and  logit  analyses  have  been  used 
extensively  at  CNA  there  are  a  number  of  CNA  publica¬ 
tions  describing  these  procedures  (see  references  1, 

2,  and  7,  for  example).  For  this  reason  we  shall 
describe  these  procedures  here  only  briefly. 

Suppose  a  dichotomous  response  is  observed  for  each 
individual,  such  as,  the  individual  stays  in  or  leaves 
the  Navy  at  a  particular  point  in  time.  Let  this  re¬ 
sponse  be  denoted  by  Y^ ,  where  Y ^  equals  1  if 
individual  i  stays  and  0  if  he  leaves. 
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Suppose  that  an  individual’s  pre-service  and  in-service 
characteristics  are  represented  by  a  vector  Zj. 

Then  probit  analysis  models  the  expectation  of 
as 


E(Yi)  =  P<Y.=1)  =  Pi  =  J  d«i>(  x )  ,  (12) 

p,zi 

where  p  is  a  vector  of  coefficients  to  be  determined  and 
*  is  the  unit  normal  distribution  function.  Similarly, 
the  logit  model  is  obtained  by  substituting 

)3'  P* 

e  /(1+  e  )  for  the  right-hand  side  of  (12).  This 
closely  approximates  the  normal  distribution  except  at 
the  tails.  The  likelihood  of  the  observations  is 

n  Y.  1-Y. 

l  =  rr  P.  (l-p. )  ,  (13) 

i*l  1  1 

where  n  is  the  number  of  individuals  (assumed  to  be  re¬ 
sponding  independently  of  one  another).  The  vector 
is  then  estimated  by  the  maximum  likelihood  solution  to 
(13).  The  coefficients  given  by  P  will  tell  us  how  the 
variables  in  Z  affect  survival.  Also,  for  any  set  of 
covariates  Z,  substitution  of  the  estimates  for  P  into 
(12)  will  give  us  an  estimate  for  the  probability  of 
survival. 

Note  that  the  estimate  of  the  probability  of  survival  is 
for  the  point  in  time  at  which  the  Y ^  values  were 
calculated.  If  we  wished  to  approximate  a  continuous 
survival  curve,  we  would  have  to  perform  a  conditional 
analysis  as  we  did  with  the  Life  Table  .method ,  estima¬ 
ting  a  different  probit  equation  at  each  step.  This  is 
very  time-consuming,  however,  and  has  the  additional 
drawback  of  giving  different  estimates  for  p  over  time, 
which  may  be  difficult  to  interpret. 

THE  COX  REGRESSION  MODEL 

The  Cox  regression  model  was  first  proposed  by  Cox  as 
a  quasi-nonparametric  method  for  estimating  a  survival 
curve  while  adjusting  for  factors  which  may  affect  it. 
Until  recently,  it  has  been  applied  mainly  in  the  bio¬ 
logical  and  health  sciences,  particularly  in  clinical 
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life  studies.  The  method  is  termed  quasi-nonparametric 
because  only  weak  assumptions  are  made  about  the  form 
of  the  survival  distribution. 

Before  I  proceed  to  describe  the  Cox  model,  recall  the 
definition  of  the  hazard  function  given  by  expression 
(7).  The  Cox  model  expresses  the  hazard  function  as 

hz(t)  =  h0(t)/’Z  ,  (14) 

where  Z  is  a  vector  of  covariates,  p  is  a  vector  of 
unknown  coefficients,  and  hg(t)  is  assumed  to  be 
fixed  and  independent  of  Z,  but  otherwise  completely 
unspecified.  Note  that  hg(t)  corresponds  to  the 
hazard  function  for  the  situation  when  Z  =  0. 

Some  properties  of  the  Cox  model  are  evident  from  the 
formulation  given  by  (14).  These  are: 

•  The  effects  of  covariates  (i.e.,  the  p's)  are 
constant  over  time. 

•  Differences  among  the  survival  distributions  of 
individuals  are  caused  only  by  changes  in  Z. 

•  Exp-  assion  (14)  is  a  proportional  hazards 
model;  i.e.,  the  hazard  for  an  individual  with 
covariate  Z^  is  proportional  to  that  for 

an  individual  with  covariate  Z2. 

The  last  property  can  be  seen  by  writing  hi(t)  = 

P'Z,  P'Z 

hQ(t)e  and  h2(t)  =  hQ(t) e  .  Then 


hx(t)  ’  ( Z1-Z2) 

h2(t)  =  6 

and  this  expression  is  independent  of  time.  In  terms  of 
the  survival  functions,  this  relationship  implies  (from 
expression  (8))  that 

P'  (2x-Z2) 

S1(t)  -  [S2(t)]e 

so  that  one  survival  curve  is  a  power  of  the  other.  In 
some  situations,  the'  proportional  hazards  property  may 
be  too  restrictive.  The  Cox  model  can  be  modified. 


-18- 


however,  to  allow  for  nonproportional  hazards  and  we 
shall  discuss  this  modification  later. 

By  keeping  ho(t)  arbitrary,  Cox  argues  that  no  in¬ 
formation  about  ft  can  be  contributed  in  intervals  in 
which  no  failures  occur,  since  hg(t)  might  conceiv¬ 
ably  be  zero  there.  He  uses  this  rationale  to  justify 
the  use  of  a  conditional  likelihood  method  based  on  the 
set  of  observed  failure  times  t^j  <  t^j  <  ...  <  t^ 

from  a  sample  of  size  n.  The  logic  behind  using  a  con¬ 
ditional  rather  than  the  usual  unconditional  method  is 
to  obtain  a  likelihood  which  is  functionally  indepen¬ 
dent  of  hg ( t )  ,  which  enables  us  to  estimate  ft. 


For  each  failure  time  t^^,  define  the  risk  set  as 

the  set  of  individuals  who  are  observed  to  fail  or  are 
censored  on  or  after  t^j.  Then  conditional  on  R^, 

the  probability  that  the  failure  at  t^j  is  by  the 

individual  as  observed  may  be  shown  to  be 


h0(t( i ) } e 


ft'Z,. 

(  i 


_  P  '  z,  ts 
h0(t  )  E  e 

U  i,R(i) 


1‘  R(i) 


(15) 


where  is  the  covariate  corresponding  to  the 

individual  with  observed  time  t, . . .  Thus  the  condi- 

tional  likelihood  of  the  data  is  formed  by  taking  the 
product  over  all  failure  times  of  terms  such  as  (15), 
i  .e. 


L 


k 

if 

i  =  l 


ft '  z 

e 


(i) 


(i) 


7?'  z 


(16) 


Estimates  of  ft  can  then  be  obtained  by  a  maximum  like¬ 
lihood  solution  of  (16).  This,  of  course,  has  to  be 
done  by  numerical  methods. 
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The  model  (14)  can  be  extended  in  a  natural  way  to  in¬ 
clude  cases  when  the  hazard  functions  cannot  be  con¬ 
sidered  proportional  to  one  another  with  respect  to  the 
levels  of  some  factor  or  combination  of  factors.  Sup¬ 
pose  that  there  are  s  strata  or  blocks  (defined  by  the 
levels  of  some  factor,  such  as  education) .  The  jth 
stratum  can  be  given  its  own  basic  hazard  hj(t) 
with  the  dependence  on  the  covariates  assumed  to  be  the 
same  for  all  strata.  In  the  jth  stratum,  the  hazard 
function  can  be  written  as 

hz(t)  =  h.(t)eP'Z 

for  an  individual  with  covariate  value  Z.  An  analysis 
of  this  model  gives  s  factors  of  the  form  (16)  to  the 
conditional  likelihood  of  fi  (one  from  each  stratum). 

Now  let  us  consider  a  method  for  estimating  the  under¬ 
lying  hazard  function  ho(t).  The  method  yields  a 
continuous  estimate  of  the  survival  function  for  the 
case  when  p  is  known.  Since  p  is,  of  course,  unknown, 
it  is  replaced  by  its  maximum  likelihood  estimate  p  . 

We  shall  take  h0(t)  to  be  a  step  function  and  pro¬ 
ceed  as  was  done  with  the  Life  Table  method,  utilizing 
exact  failure  and  censoring  times. 

Suppose  0  <  <  *2  <  •••  <  •'jj  are  fixed,  predetermined 

constants  and  define  the  intervals  Ij,  j=l,...,k+l, 
by  1^  =  i/j),  where  v0=0  and  v  k+1=  ®  .  We  shall 

assume  that 


h0(t)  =  Aj  for  t  e  I j  . 

Next,  suppose  we  observe  dj  failures  in  the  interval 
Ij  at  times  tA  *  -1  * , . . . ,  t^  ^  and  rrij  censored  observa¬ 
tions  in  the  same  interval  at  times  s^ ^ ^ . ,sm  ^ ^ . 

Let  the  covariate  values  corresponding  to  the  failure 
times  be  denoted  by  ^  , . . . ,  Z^ ,  ^  and  those  corre¬ 

sponding  to  the  censored  times  be^denoted  by 


, . . . ,Zm  *' *  .  Define 


'1  '  *  *  * '  mj 
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1 


dj  ^Z/j)  Ml 

Qj  =  L  e  /  <V  Vi* 

i=i  i 


mj  /B'z 


i-1 


7 


j-l) 


and 


R.  = 


3 

L  e 

/•I 


P '  2, 


( j )  m 


3 

+  L  e 

a  *i 


P'Z 


(j) 


Then  the  maximum  likelihood  estimate  of  X-j  can  be 
shown  to  be  J 


A 

XJ  ’ 


d. 

_JL 


Q 


k+1  »  j=l,...,k 

3 +  urvi}  £  */ 

/- j+i 


(17) 


X  -  k+1 


k+l  Q. 


k+1 


From  expressions  (8)  and  (14),  the  maximum  likelihood 
estimate  for  the  survival  function  is 


I  »v  ^ 


A  1  fli  7.  A 

S(t)  =  exp  1-e  [£  Xi(”i-Vi_1) 

'  i=l 


+  Xj(t-v._1)] 


,  tel. 


(18) 


for  ar.  individual  with  covariate  value  Z,  where  the 
At  are  given  by  expression  (17). 


-21- 


COMPARISON  OF  THE  COX  AND  PROBIT  MODELS 


In  order  to  examine  the  efficacy  of  using  the  Cox  re¬ 
gression  model  on  a  future  cross-sectional  data  base, 
we  will  compare  it  with  probit  analysis  on  the  1973 
recruit  cohort  of  4-year  obligors.  A  longitudinal  data 
base  is  necessary  to  make  this  comparison,  since  probit 
analysis  can  be  applied  only  to  this  type  of  data. 

Note  that  we  are  using  the  probit  estimates  as  the 
standard  of  comparison,  not  on  any  theoretical  grounds, 
but  on  an  empirical  basis,  since  probit  analysis  seems 
to  have  given  reasonable  results  when  applied  in  pre¬ 
vious  CNA  studies. 

The  1973  cohort  will  be  divided  into  2  groups  —  one 
who  were  promised  or  received  an  A-school  assignment 
and  another  who  were  assigned  to  general  detail 
(GENDETs).  Separate  analyses  will  be  performed  for 
each  group,  since  previous  work  has  shown  that  the  ef¬ 
fects  of  various  pre-service  and  in-service  character¬ 
istics  are  quite  different  with  respect  to  the  two 
groups  (see  reference  8).  When  applying  the  Cox  re¬ 
gression  model  to  these  data,  we  will  assume  that  the 
covariate  effects  are  constant  over  the  4-year  period 
(the  period  for  which  survival  curves  are  obtained). 

If  this  assumption  is  not  approximately  true,  we  can 
expect  a  poor  correspondence  between  the  probit  and  Cox 
survival  curves.  As  will  be  seen,  the  Cox  and  probit 
curves  agree  rather  well,  and  so  the  assumption  of 
constant  covariate  effects  seems  to  be  reasonable. 

The  covariates  which  we  shall  consider  in  our  analysis 
are  shown  below. 

PDEPS  =  1  if  enlistee  has  primary  dependents 
=  0  otherwise 

RACE  =  1  if  enlistee  is  non-Caucasian 
=  0  otherwise 

MGRP  =  1  if  enlistee  is  in  mental  groups  3  lower  or  4 
=  0  otherwise 

EDUC  =  1  if  enlistee  has  less  than  12  yrs  of  education 
=  0  otherwise 
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AGE17  =  1  if  enlistee  is  17  years  old 
=  0  otherwise 

AGE19  =  1  if  enlistee  is  19  years  old 
=  0  otherwise 

AGE20P  =  1  if  enlistee  is  20  or  more  years  old 
=  0  otherwise. 

The  base  group  of  individuals  corresponds  to  those 
having  a  value  of  0  for  each  of  the  variables  listed 
above. 

By  performing  separate  probit  analyses  at  each  month 
considering  only  those  recruits  who  survived  the  pre¬ 
vious  month,  we  can  estimate  the  effects  on  survival 
for  each  of  the  7  covariates  in  each  monthly  interval. 
Since  this  involves  estimating  8  (a  constant  and  seven 
covariates)  x  48  (monthly  intervals)  =  384  parameters 
for  each  group  (A-schoolers  and  GENDETs) ,  we  shall  not 
display  the  parameter  estimates  here.  We  show  instead 
the  probit  survival  curves  for  GENDETs  and  A-schoolers 
in  figures  1  and  4  respectively.  The  survival  curves 
in  each  of  these  2  groups  are  further  classified  ac¬ 
cording  to  quality  levels  A,  B,  C,  and  D,  defined  be¬ 
low,  where  MG  *  mental  group  and  LT12,  GE12  denote  re¬ 
cruits  with  less  than  12  and  greater  than  or  equal  to 
12  years  of  education  . 


MG1-3U 

MG3L-4 


It  is  not  our  intention  here  to  analyze  the  results  of 
these  probit  analyses  in  detail,  since  previous  CNA 
studies  have  already  covered  this  ground.  We  would 
like  to  note,  however,  the  advantage  of  obtaining  en¬ 
tire  survival  curves  as  opposed  to  point  estimates  of 
survival.  This  is  evident  on  observing  the  huge  attri¬ 
tion  rates  after  only  2  months  of  service  (correspond¬ 
ing  to  the  end  of  boot  camp)  in  the  survival  curves  for 
GENDETs.  The  times  at  which  the  losses  occur  can  be  of 
great  importance  in  formulating  manpower  policies  and 
procedures,  whereas  merely  obtaining  a  point  estimate 
of  survival  at  1  year  or  4  years,  say,  would  yield  very 
little  information  on  the  patterns  of  attrition  over 
time. 


GE12  LT12 


A 

B 

C 

D 
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Note  that  in  each  of  figures  1  through  5,  the  curves 
show  a  sudden  dip  at  45  months.  This  occurs  as  a  re¬ 
sult  of  the  Navy’s  early-out  program,  which  allows  in¬ 
dividuals  to  leave  the  Navy  or  reenlist  up  to  3  months 
before  their  initial  term  of  obligation  expires. 

The  probit  estimates  of  survival  were  obtained  at  great 
computational  expense.  The  computation  of  the  GENDETs' 
survival  curves  (corresponding  to  about  6,000  individ¬ 
uals)  took  approximately  one  hour  of  processing  time  on 
a  Burroughs  B6750  computer.  For  the  roughly  35,000  A- 
schoolers,  however,  it  took  nearly  eleven  hours  of  pro¬ 
cessing  time.  This  amount  of  computation  is  clearly 
undesirable  and  inefficient.  On  the  other  hand,  the 
survival  curve  estimates  from  the  Cox  regression  model 
took  only  2  1/2  minutes  for  GENDETs  an  17  minutes  for 
A-schoolers . 

Survival  curves  for  GENDETs  were  calculated  with  both 
the  proportional  hazards  model  of  expression  (14)  and 
the  nonproportional  hazards  of  the  extension  model  dis¬ 
cussed  in  the  previous  section.  The  results  are  shown 
in  figures  2  and  3.  Estimates  of  the  coefficients  of 
the  covariates  described  earlier  are  shown  in  table  6 
for  GENDETs  and  in  table  7  for  A-schoolers. 


TABLE  6 

COEFFICIENT  ESTIMATES  IN  THE  COX  REGRESSION 
MODEL  FOR  GENDETs 


Variable 

Coefficient 

Standard 

deviation 

X2 

PDEPS 

0.048 

0.053 

0. 83 

RACE 

-0.120 

0.034 

12.46 

MGRP 

-0.253 

0.031 

66.59 

EDUC 

0.207 

0.030 

47.61 

AGE17 

0.041 

0.034 

1.46 

AGE19 

0.015 

0.041 

0.14 

AGE20P 

-0.034 

0.043 

0.62 
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TABLE  7 


COEFFICIENT  ESTIMATES  IN  THE  COX  REGRESSION 
MODEL  FOR  A-SCHOOLERS 


Variable 

Coefficient 

Standard 

deviation 

X2 

PDEPS 

0.058 

0.023 

6.45 

RACE 

-0.011 

0.021 

0.25 

MGRP 

0.094 

0.013 

52.56 

EDUC 

0.197 

0.015 

172.92 

AGE17 

0.151 

0.015 

100. 80 

AGE19 

-0.036 

0.015 

5.90 

AGE20P 

-0.077 

0.016 

23.04 

The  X2  values  given  in  tables  6  and  7  should  be 
compared  with  the  value  3.84,  which  is  the  0.05  per¬ 
centile  of  a  X2“distribution  with  one  degree  of 
freedom.  Values  greater  than  3.84  mean  that  the  co¬ 
efficients  corresponding  to  these  values  are  signifi¬ 
cantly  different  from  0.  The  magnitudes  and  directions 
of  the  estimated  coefficients  are  similar  to  those  ob¬ 
tained  from  probit  analysis. 

Returning  to  figures  2  and  3,  we  see  that  the  estimates 
from  the  nonproportional  hazards  version  of  the  Cox 
model  (figure  3)  more  closely  resemble  the  probit  esti¬ 
mates  than  do  those  from  the  proportional  hazards  model 
(figure  2).  Therefore  the  assumption  of  proportional 
hazards  is  probably  a  bit  too  strong  for  these  data. 

On  the  other  hand,  the  assumption  that  the  covariate 
effects  remain  constant  over  time  seems  to  be  reason¬ 
able.  Thus,  the  nonproportional  hazards  version  of  the 
Cox  model  appears  to  estimate  survival  probabilities 
rather  well,  at  least  for  GENDETs . 

In  light  of  the  results  for  GENDETs,  we  decided  to  use 
the  nonproportional  hazards  version  of  the  Cox  model  in 
estimating  the  survival  curves  for  A-schoolers.  The 
results  are  shown  in  figure  5.  Again,  the  survival 
curve  estimates  from  the  Cox  model  closely  resemble 
those  obtained  from  the  probit  model. 

From  an  analysis  of  the  1973  cohort,  we  have  seen  that 
the  Cox  model  gives  reasonable  estimates  of  the  survival 
curves  for  GENDETs  and  A-schoolers.  Furthermore,  the 
results  are  more  easily  interpretable  (because  of  the 


-25- 


assumption  of  constant  covariate  effects  over  time)  and 
computationally  more  efficient  than  the  probit  counter¬ 
parts.  Thus,  based  on  the  evidence  presented  here,  the 
Cox  model  seems  like  a  potentially  useful  procedure  for 
estimating  survival  from  cross-sectional  data. 
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FIG.  1 :  PROBIT  SURVIVAL  CURVES  FOR  4  YO  GENDETS 
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FIG.  3:  COX  SURVIVAL  CURVES  FOR  4YO  GENDETS- NONPROPORTIONAL 

HAZARDS  MODEL 


FIG.  5:  COX  SURVIVAL  CURVES  FOR  4YO  A-SCHOOLERS- NONPROPORTIONAL 

HAZARDS  MODEL 
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